.Dd November 30, 2024
.Dt LLAMAFILER 1
.Os Mozilla Ocho
.Sh NAME
.Nm llamafiler
.Nd fast reliable large language model server
.Sh SYNOPSIS
.Nm
.Fl m Ar model.gguf
.Op flags...
.Sh DESCRIPTION
.Nm
llamafiler is an HTTP server for Large Language Models (LLMs). It
includes a web GUI for both chatbot and text completion. It can be your
OpenAI API compatible embeddings / completions / chat completions
server. It's able to more intelligently recycle context windows across
multiple slots serving multiple clients.
.Sh OPTIONS
The following options are available:
.Bl -tag -width indent
.It Fl Fl version
Print version and exit.
.It Fl h , Fl Fl help
Show help message and exit.
.It Fl m Ar FNAME , Fl Fl model Ar FNAME
Path of GGUF model weights. Each server process is currently limited to
serving only one model. If you need to host multiple models, then it's
recommended that you run multiple instances of llamafiler behind a
reverse proxy such as NGINX or Redbean.
.It Fl mm Ar FNAME , Fl Fl mmproj Ar FNAME
Path of vision model weights.
.It Fl Fl db Ar FILE
Specifies path of sqlite3 database.
.Pp
The default is
.Pa ~/.llamafile/llamafile.sqlite3
.It Fl ngl Ar N , Fl Fl gpu-layers Ar N , Fl Fl n-gpu-layers Ar N
Specifies number of layers to offload to GPU.
.Pp
This flag must be passed in order to use GPU on systems with NVIDIA or
AMD GPUs. If you're confident that you have enough VRAM, then you can
pass
.Fl ngl Ar 999
to enable full offloading, since this number is automatically downtuned
to however many number of layers the model has. If VRAM is limited, then
the
.Fl Fl verbose
flag may be passed to learn how many layers the model has, e.g. 35,
which can then be down-tuned until the out of memory error goes away.
.Pp
On Apple Silicon systems with Metal, GPU offloading is enabled by
default. Since these GPUs use unified memory, they're treated as having
a single layer; therefore, using values higher than 1 will be treated as
1. You can pass
.Fl ngl Ar 0
to disable GPU offloading and run in CPU mode on Apple Metal systems.
.It Fl l Ar HOSTPORT , Fl Fl listen Ar HOSTPORT
Specifies the local [HOST:]PORT on which the HTTP server should listen.
By default this is 0.0.0.0:8080 which means llamafiler will bind to port
8080 on every locally available IPv4 network interface. This option may
currently only be specified once.
.It Fl c Ar TOKENS , Fl Fl ctx-size Ar TOKENS
Specifies context size. This specifies how long a completion can get
before it runs out of space. It defaults to 8k which means 8192 tokens.
Many models support a larger context size, like 128k, but that'll need
much more RAM or VRAM per slot. If this value is larger than the trained
context size of the model, it'll be tuned down to the maximum. If this
value is 0 or negative, the maximum number of tokens will be used.
.It Fl s Ar COUNT , Fl Fl slots Ar COUNT
Specifies how many slots to maintain. This defaults to 1. Slots are used
by chat completions requests. When such a request comes in, the client
needs to take control of a slot. When the completion is finished, the
slot is relinquished back to the server. HTTP clients will wait for a
slot to be relinquished if none are available. Tuning this parameter to
nicely fit available RAM or VRAM can help you manage your server
resources, and control how much completion parallelism can happen.
Please note that
.Fl Fl ctx-size
has a strong influence on how many slots can be created.
.It Fl Fl decay-delay Ar INT
Number of seconds a context window slot needs to be inactive before the
system starts to strongly consider giving it to other clients. The
default is 300 which is five minutes.
.It Fl Fl decay-growth Ar FLOAT
Sets slot decay growth factor. Context window slots are assigned in a
least recently used fashion, based on the formula
.EQ
age + e sup {growth * (age - delay)}
.EN
.It Fl p Ar TEXT , Fl Fl prompt Ar TEXT , Fl Fl system-prompt Ar TEXT
Specifies system prompt. This value is passed along to the web frontend.
.It Fl Fl no-display-prompt
Hide system prompt from web user interface.
.It Fl Fl nologo
Hide llamafile logo icon from web ui.
.It Fl Fl url-prefix Ar URLPREFIX
Specifies a URL prefix (subdirectory) under which the HTTP server will
make the API accessible, e.g. /lamafiler. Useful when running llamafiler
behind a reverse proxy such as NGINX or Redbean. By default, this is set
to / (root).
.It Fl Fl verbose
Enable logging of diagnostic information. This flag is useful for
learning more about the model and hardware. It can also be helpful for
troubleshooting errors. We currently recommend that this flag be avoided
in production since the llama.cpp logger may disrupt thread cancelation.
.It Fl w Ar N , Fl Fl workers Ar N
Number of HTTP client handling threads.
.It Fl Fl trust Ar CIDR
Adds a network to the trusted network list. This argument is specified
in the form IPV4/MASKBITS, e.g. 192.168.0.0/24. By default, all clients
are untrusted, which means they're subject to token bucket throttling,
and additional security precautions that may cause request handling to
go slightly slower. Therefore this flag is important to use if you want
to accurately benchmark llamafiler, since the server will otherwise see
the benchmark as a DDOS and deprioritize its traffic accordingly.
.It Fl Fl ip-header Ar STR
If this flag is passed a value, e.g. X-Forwarded-For, then any trusted
may send this header to your llamafile server to let it know what the
true effective client IPv4 address actually is. After this happens the
default security restrictions, e.g. token bucket, will be measured and
applied against that IPv4 address and its adjacent networks.
.It Fl Fl token-rate Ar N
Specifies how many times per second a token is dropped in each bucket.
This setting is used to define a limitation on how many TCP connects and
HTTP messages each chunk of the IPv4 address space is permitted to send
to llamafiler over a sustained period of time. The default token rate is
1, which means that, on a long enough timeline, a class-C network will
be deprioritized if it sends more than one request per second. No real
penalty actually applies though until the server runs out of resources,
e.g. HTTP request workers.
.It Fl Fl token-burst Ar N
Specifies how many HTTP requests and TCP connects a given slice of the
IPv4 address space is permitted to send within a short period of time,
before token bucket restrictions kick in, and cause the client to be
deprioritized. By default, this value is set to 100. It may be tuned to
any value between 1 and 127 inclusive.
.It Fl Fl token-cidr Ar N
Specifies IPv4 address space granularity of token bucket algorithm, in
network bits. By default, this value is set to 24 which means individual
IPv4 addresses are viewed as being representative members of a class-C
network, or in other words, each group of 256 IPv4 addresses is lumped
together. If one IP in the group does something bad, then bad things
happen to all the other IPv4 addresses in that granule. This number may
be set to any integer between 3 and 32 inclusive. Specifying a higher
number will trade away system memory to increase network specificity.
For example, using 32 means that 4 billion individual token buckets will
be created. By default, a background thread drops one token in each
bucket every second, so that could potentially be a lot of busy work. A
value of three means that everyone on the Internet who talks to your
server will have to fight over only eight token buckets in total.
.It Fl Fl unsecure
Disables sandboxing. By default, llamafiler puts itself in a SECCOMP BPF
sandbox, so that even if your server gets hacked in the worst possible
way (some kind of C++ memory bug) then there's very little damage an
attacker will be able to do. This works by restricting system calls
using Cosmopolitan Libc's implementation of pledge() which is currently
only supported on Linux (other OSes will simply be unsecured by
default). The pledge security policy that's used by default is "stdio
anet" which means that only relatively harmless system calls like
read(), write(), and accept() are allowed once the server has finished
initializing. It's not possible for remotely executed code to do things
like launch subprocesses, read or write to the filesystem, or initiate a
new connection to a server.
.It Fl k Ar N , Fl Fl keepalive Ar N
Specifies the TCP keepalive interval in seconds. This value is passed
along to both TCP_KEEPIDLE and TCP_KEEPINTVL if they're supported by the
host operating system. If this value is greater than 0, then the the
SO_KEEPALIVE and TCP_NODELAY options are enabled on network sockets, if
supported by the host operating system. The default keepalive is 5.
.It Fl Fl http-obuf-size Ar N
Size of HTTP output buffer size, in bytes. Default is 1048576.
.It Fl Fl http-ibuf-size Ar N
Size of HTTP input buffer size, in bytes. Default is 1048576.
.It Fl Fl chat-template Ar NAME
Specifies or overrides chat template for model.
.Pp
Normally the GGUF metadata tokenizer.chat_template will specify this
value for instruct models. This flag may be used to either override the
chat template, or specify one when the GGUF metadata field is absent,
which effectively forces the web ui to enable chatbot mode.
.Pp
Supported chat template names are: chatml, llama2, llama3, mistral
(alias for llama2), phi3, zephyr, monarch, gemma, gemma2 (alias for
gemma), orion, openchat, vicuna, vicuna-orca, deepseek, command-r,
chatglm3, chatglm4, minicpm, deepseek2, or exaone3.
.Pp
It is also possible to pass the jinja2 template itself to this argument.
Since llamafiler doesn't currently support jinja2, a heuristic will be
used to guess which of the above templates the template represents.
.It Fl Fl completion-mode
Forces web ui to operate in completion mode, rather than chat mode.
Normally the web ui chooses its mode based on the GGUF metadata. Base
models normally don't define tokenizer.chat_template whereas instruct
models do. If it's a base model, then the web ui will automatically use
completion mode only, without needing to specify this flag. This flag is
useful in cases where a prompt template is defined by the gguf, but it
is desirable for the chat interface to be disabled.
.It Fl Fl db-startup-sql Ar CODE
Specifies SQL code that should be executed whenever connecting to the
SQLite database. The default is the following code, which enables the
write-ahead log.
.Bd -literal -offset indent
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;
.Ed
.It Fl Fl reserve-tokens Ar N
Percent of context window to reserve for predicted tokens. When the
server runs out of context window, old chat messages will be forgotten
until this percent of the context is empty. The default is 15%. If this
is specified as a floating point number, e.g. 0.15, then it'll be
multiplied by 100 to get the percent.
.El
.Sh EXAMPLES
Here's an example of how you might start this server:
.Pp
.Dl "llamafiler -m all-MiniLM-L6-v2.F32.gguf"
.Pp
Here's how to send a tokenization request:
.Pp
.Dl "curl -v http://127.0.0.1:8080/tokenize?prompt=hello+world"
.Pp
Here's how to send a embedding request:
.Pp
.Dl "curl -v http://127.0.0.1:8080/embedding?content=hello+world"
.Sh DOCUMENTATION
Read our Markdown documentation for additional help and tutorials. See
llamafile/server/doc/index.md in the source repository on GitHub.
.Sh SEE ALSO
.Xr llamafile 1 ,
.Xr whisperfile 1
